where \(y_i^*\) is what we wish we could measure (say, the probability \(y_i=1\)), though we can only measure \(y_i\). Now, \(y^*\) is going to be our principle quantity of interest.
\(F\) is a nonlinear function, so the model is nonlinear in parameters. \(x\hat{\beta}\) is the linear prediction, but it is not the quantity of interest. Instead, the quantity of interest is \(F(x\hat{\beta})\) which is equal to \(y_i^*\).
Important Concept
The difference between this model and the OLS linear model is simply that we must transform the linear prediction, \(x\hat{\beta}\), by \(F\) in order to produce predictions. Put differently, we want to map our linear prediction, \(x\beta\) onto \(y^*\).
Why leave the robust OLS model?
Why not use the OLS model when \(y\) is binary?
because we fail to satisfy the OLS assumptions.
because the residuals are not Normal.
because \(y\) is not Normal.
because \(y\) is limited.
Limited \(y\) variables are \(y\)s where our measurement is limited by the realities of the world. Such variables are rarely normal, often not continuous, and often observable indicators of unobservable things - this is true of most binary variables.
Limited Dependent Variables
Why would we measure \(y_i\) rather than \(y_i^*\)?
Limited dependent variables are usually limited in the sense that we cannot observe the range of the variable or the characteristic of the variable we want to observe. We are limited to observing \(y_i\), and so must estimate \(y_i^*\).
unordered or nominal categorical variables: type of car you prefer: Honda, Toyota, Ford, Buick; policy choices; candidates in an election.
ordered variables that take on few values: some survey responses.
discrete count variables; number of episodes of scarring torture in a country-year, 0, 1, 2, 3, \(\ldots\), \(\infty\)
time to failure; how long a civil war lasts; how long a patient survives disease; how long a leader survives in office.
Binary dependent variables
Generally, we conceive of a binary variable as being the observable manifestation of some underlying, latent, unobserved continuous variable.
If we could adequately observe (and measure) the underlying continuous variable, we’d use some form of OLS regression to analyze that variable.
Why not use OLS?
\[ \mathbf{y}=\mathbf{X \beta} + \mathbf{u} \]
where we are principally interested in the conditional expectation of \(y\), \(E(y_{i}|\mathbf{x_{i}})\) where we want to interpret that expectation as a conditional probability, \(Pr(y=1|\mathbf{x_{i}})\); we focus on the probability the outcome occurs (i.e., \(y\) is equal to one).
Linear Probability Model
The linear probability model (LPM) is the OLS linear regression with a binary dependent variable.
The main justification for the LPM is OLS is unbiased (by Gauss Markov). But \(\ldots\)
predictions are nonsensical (linear, unbounded, measures of \(\hat{y}\) rather than \(y^*\)).
disturbances are non-normal, and heteroskedastic.
relation or mapping of \(x\beta\) and \(y\) are the wrong functional form (linear).
Running example - Democratic Peace data
As a running example, I’ll use the Democratic Peace data to estimate logit and probit models. These come from Oneal and Russett (1997)’s well-known study in ISQ. The units are dyad-years; the \(y\) variable is the presence or absence of a mililtarized dispute, and the \(x\) variables include a measure of democracy (the lowest of the two Polity scores in the dyad), and a set of controls.
ggplot(df, aes(x=olspreds)) +geom_density(alpha=.5)+labs(title="Density of OLS Predictions", x="Predictions", y="Density")+theme_minimal()+geom_vline(xintercept=0, linetype="dashed")
Heteroskedastic Residuals
code
df <-data.frame(df, mols$residuals)bucolors<-list("#005A43","#6CC24A", "#A7DA92", "#BDBEBD", "#000000" )#plot density of residuals color by disputeggplot(df, aes(x=mols.residuals, fill=dispute)) +geom_density(alpha=.5)+labs(title="Density of OLS Residuals", x="Residuals", y="Density")+theme_minimal()+geom_vline(xintercept=0, linetype="dashed")+scale_fill_manual(values=bucolors)
When is the LPM Reasonable?
On Linearity
In the linear model, \(\hat{y_i}=x_i\widehat{\beta}\). This makes sense because \(y = y^*\). Put differently, \(y\) is continuous, unbounded, (assumed) normal, and is an “unlimited” measure of the concept we intend to measure.
In binary models, \(y \neq y^*\), because our observation of \(y\) is limited such that we can only observe its presence or absence. We have two different realizations of the same variable: \(y\) is the limited but observed variable; \(y^*\) is the unlimited variable we want to measure, but cannot because it is unobservable.
The goal of these models is to use \(y\) in the regression in order to get estimates of \(y^*\). Those estimates of \(y^*\) are our principle quantity of interest in the binary variable model.
Linking \(x\widehat{\beta}\) and \(y^*\)
We can produce the linear prediction, \(x\widehat{\beta}\), but we need to transform it to produce estimates of \(y^*\). To do so, we use a link function to map \(x_i\beta\) onto the probability space, \(y^*\). This means \(\widehat{y_i} \neq x\widehat{\beta}\). Instead,
\[y^* = F(x_i\beta)\]
Where \(F\) is a continuous, sigmoid probability CDF. This is how we get estimates of our quantity of interest, \(y^*\).
Non linear change in Pr(y=1) across values of \(x\)
In the LPM, the relationship between \(Pr(y=1)\) and \(X\) is linear, so the rate of change toward \(Pr(y=1)\) is constant across all values of \(X\).
This means that the rate of change approaching one (or approaching zero) is exactly the same as the rate of change anywhere else in the distribution.
For example, this means that the change from .99 to 1.00 is just as likely as the change from .50 to .51; is this sensible for a bounded latent variable (probability)?
So \(y\) is binary, and we’ve established the linear model is not appropriate. The observed variable, \(y\), we might characterize as binomial (hence the binomial parameter, \(\pi\)).
This is the binomial log-likelihood function - we estimate it by maximum likelihood - just another technology (alongside OLS, Bayesian modeling, and others), for estimating unknowns.
So, between the logit and probit models, which do we choose?
Unless we have theoretical expectations that inform us about the data in the tails of the distribution (we don’t), then it makes no difference which estimation procedure you select. Choose what you find most comfortable, most readily interpretable.
Notice there are no substantive differences between the logit and probit. As in the linear model, we can make statements about direction and significance looking at the model estimates. We can’t say much about magnitude because the coefficients are not transformed by \(F\) - this is where we turn to quantities of interest.
Quantities of Interest
In the linear model, the main quantity of interest is \(x_i\hat{\beta}\). Here, we need \(x_i\hat{\beta}\), but then need to transform by the link function, \(F\):
\(F(x_i\hat{\beta})\)
Measures of uncertainty about \(F(x_i\hat{\beta})\)
Confidence Intervals
End point transformation is straightforward as in the linear model:
estimate the model.
generate the linear prediction, \(x\hat{\beta}\).
generate the standard error of the prediction
generate linear predictions of the upper bound (\(x\hat{\beta}+c*se_p\)) and lower bound (\(x\hat{\beta}-c*se_p\)), where \(c\) is the critical value , \(\sim N(0,1)\).
generate the predictions at upper, lower, and central points by \(F(x\hat{\beta}\pm c*se_p)\) and \(F(x\hat{\beta})\)
plot the upper and lower quantities of interest, perhaps around the central point.
Predictions
We do this exactly as in the linear model, with the one modification that we map the linear prediction, \(x_i\hat{\beta}\) onto the probability space by the link function. Here, for instance, is the at-means approach:
estimate the model
generate the linear prediction, \(x\beta\), using the at-means data, varying the \(x\) of interest, and model estimates, \(\widehat{\beta}\).
generate the standard errors of the linear predictions the usual way.
generate upper and lower bounds of the linear prediction.
map those linear predictions and the boundaries onto \(y^*\), the latent probability space: \(F(x \widehat{\beta})\)
plot the predictions and measures of uncertainty against the variable of interest.
Transformation by \(F\)
So you’ve got the three quantities we’ve been working with all semester:
\(x\widehat{\beta}\)
\(x\widehat{\beta} + 1.96*se\)
\(x\widehat{\beta} - 1.96*se\)
All transform the linear prediction and standard error of the linear prediction. For the logit, transform as
Any of the prediction methods we’ve learned are the same here - just remember to transform the quantities by \(F\).
At-mean predictions (logit)
code
#summary(m1)#confint(m1)#new data frame for MEM predictionmem <-data.frame(deml=c(seq(-10,10,1)), border=0, caprat=median(dp$caprat), ally=0)# type="link" produces the linear predictions; transform by hand below w/EPTmem <-data.frame(mem, predict(m1, type="link", newdata=mem, se=TRUE))mem <-cbind(mem,lb=plogis(mem$fit-1.96*mem$se.fit),ub=plogis(mem$fit+1.96*mem$se.fit), p=plogis(mem$fit))ggplot(mem, aes(x=deml, y=p)) +geom_line() +geom_ribbon(data=mem, aes(x=deml, ymin=lb, ymax=ub),fill ="grey70", alpha = .4, ) +labs(x="Polity Score", y="Pr(Dispute) (95% confidence interval)")
Average effects (logit)
Average effects are often a better choice because they represent the data more completely than central tendency can (in the at-mean effects). Here are average effects (using the logit estimates) across the range of polity, and for pairs of states that share borders and those that do not.
Oneal, John R., and Bruce M. Russett. 1997. “The Classic Liberals Were Right: Democracy, Interdependence, and Conflict, 1950-1985.”International Studies Quarterly 4 (2): 267–94.